{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用成员推理测试模型安全性\n",
    "`Linux` `Ascend` `GPU` `CPU` `模型评测` `企业` `高级`\n",
    "\n",
    "作者：MindSpore团队、[丁一超](https://gitee.com/JeffDing890430)\n",
    "\n",
    "## 概述\n",
    "\n",
    "成员推理是一种推测用户隐私数据的方法。隐私指的是单个用户的某些属性，一旦泄露可能会造成人身损害、名誉损害等后果。通常情况下，用户的隐私数据会作保密处理，但我们可以利用非敏感信息来进行推测。如果我们知道了某个私人俱乐部的成员都喜欢戴紫色墨镜、穿红色皮鞋，那么我们遇到一个戴紫色墨镜且穿红色皮鞋（非敏感信息）的人，就可以推断他/她很可能是这个私人俱乐部的成员（敏感信息）。这就是成员推理。\n",
    "\n",
    "机器学习/深度学习的成员推理(MembershipInference)，指的是攻击者拥有模型的部分访问权限(黑盒、灰盒或白盒)，能够获取到模型的输出、结构或参数等部分或全部信息，并基于这些信息推断某个样本是否属于模型的训练集。利用成员推理，我们可以评估机器学习/深度学习模型的隐私数据安全。如果在成员推理下能正确识别出60%+的样本，那么我们认为该模型存在隐私数据泄露风险。\n",
    "\n",
    "这里以VGG16模型，CIFAR-100数据集为例，说明如何使用MembershipInference进行模型隐私安全评估。本教程使用预训练的模型参数进行演示，这里仅给出模型结构、参数设置和数据集预处理方式。\n",
    "\n",
    ">本例面向Ascend 910处理器，您可以在这里下载完整的样例代码：\n",
    ">\n",
    "><https://gitee.com/mindspore/mindarmour/blob/master/examples/privacy/membership_inference/example_vgg_cifar.py>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 实现阶段\n",
    "### 安装MindArmour"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.2.0-rc1/MindArmour/x86_64/mindarmour-1.2.0rc1-cp37-cp37m-linux_x86_64.whl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注：本次实验使用的平台为MindSpore1.2.0-rc1，所以MindArmour也是用对应的1.2.0-rc1，如果使用的是1.2.1的，只需将命令中的1.2.0-rc1替换为1.2.1。  \n",
    "\n",
    "**MindArmour安装文档参考：<https://gitee.com/mindspore/mindarmour/blob/master/README_CN.md>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入需要的库文件\n",
    "\n",
    "#### 引入相关包\n",
    "\n",
    "下面是我们需要的公共模块、MindSpore相关模块和MembershipInference特性模块，以及配置日志标签和日志等级。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import math\n",
    "import argparse\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import mindspore.nn as nn\n",
    "from mindspore import Model, load_param_into_net, load_checkpoint\n",
    "from mindspore import dtype as mstype\n",
    "from mindspore.common import initializer as init\n",
    "from mindspore.common.initializer import initializer\n",
    "import mindspore.dataset as de\n",
    "import mindspore.dataset.transforms.c_transforms as C\n",
    "import mindspore.dataset.vision.c_transforms as vision\n",
    "from mindarmour import MembershipInference\n",
    "from mindarmour.utils import LogUtil\n",
    "\n",
    "LOGGER = LogUtil.get_instance()\n",
    "TAG = \"MembershipInference_test\"\n",
    "LOGGER.set_level(\"INFO\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 加载数据集\n",
    "\n",
    "这里采用的是CIFAR-100数据集，您也可以采用自己的数据集，但要保证传入的数据仅有两项属性\"image\"和\"label\"。\n",
    "\n",
    "数据集：CIFAR-100 下载地址：[链接](http://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz)\n",
    "\n",
    "也可在Jupyter Notebook中执行下面代码完成下载解压到指定文件夹。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./cifar100\n",
      "├── test\n",
      "│   └── test.bin\n",
      "└── train\n",
      "    ├── fine_label_names.txt\n",
      "    └── train.bin\n",
      "\n",
      "2 directories, 3 files\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p ./cifar100/train ./cifar100/test\n",
    "!wget https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/cifar-100-binary.tar.gz\n",
    "!tar -zxvf cifar-100-binary.tar.gz\n",
    "!mv -f ./cifar-100-binary/train.bin ./cifar-100-binary/fine_label_names.txt ./cifar100/train/\n",
    "!mv ./cifar-100-binary/test.bin ./cifar100/test/\n",
    "!tree ./cifar100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ckpt文件参考MindSpore代码仓库中ModelZoo中VGG16代码将cifar10的代码修改为cifar100的代码进行训练产生\n",
    "\n",
    "训练代码下载：[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/vgg16)\n",
    "\n",
    "训练完成的ckpt文件下载地址（百度网盘提取码: jits）： [链接](https://pan.baidu.com/s/10jeLzJ1Sl23gjoc-AZd-Ng) \n",
    "\n",
    "在Jupyter Notebook中可执行下面代码下载训练完成的ckpt，并放置到指定位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/0-70_781.ckpt\n",
    "!mkdir -p ./ckpt\n",
    "!mv -f ./0-70_781.ckpt ./ckpt/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建数据增强方法\n",
    "\n",
    "使用MindSpore提供的cifar100数据集处理接口，将`.bin`数据集读取出来，并通过归一化和标准化等数据增强操作将数据集处理成适合`vgg_net`训练的数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate CIFAR-100 data.\n",
    "def vgg_create_dataset100(data_home, image_size, batch_size, rank_id=0, rank_size=1, repeat_num=1,\n",
    "                          training=True, num_samples=None, shuffle=True):\n",
    "    \"\"\"Data operations.\"\"\"\n",
    "    de.config.set_seed(1)\n",
    "    data_dir = os.path.join(data_home, \"train\")\n",
    "    if not training:\n",
    "        data_dir = os.path.join(data_home, \"test\")\n",
    "\n",
    "    if num_samples is not None:\n",
    "        data_set = de.Cifar100Dataset(data_dir, num_shards=rank_size, shard_id=rank_id,\n",
    "                                      num_samples=num_samples, shuffle=shuffle)\n",
    "    else:\n",
    "        data_set = de.Cifar100Dataset(data_dir, num_shards=rank_size, shard_id=rank_id)\n",
    "\n",
    "    input_columns = [\"fine_label\"]\n",
    "    output_columns = [\"label\"]\n",
    "    data_set = data_set.rename(input_columns=input_columns, output_columns=output_columns)\n",
    "    data_set = data_set.project([\"image\", \"label\"])\n",
    "\n",
    "    rescale = 1.0 / 255.0\n",
    "    shift = 0.0\n",
    "\n",
    "    # Define map operations.\n",
    "    random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))  # padding_mode default CONSTANT.\n",
    "    random_horizontal_op = vision.RandomHorizontalFlip()\n",
    "    resize_op = vision.Resize(image_size)  # interpolation default BILINEAR.\n",
    "    rescale_op = vision.Rescale(rescale, shift)\n",
    "    normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))\n",
    "    changeswap_op = vision.HWC2CHW()\n",
    "    type_cast_op = C.TypeCast(mstype.int32)\n",
    "\n",
    "    c_trans = []\n",
    "    if training:\n",
    "        c_trans = [random_crop_op, random_horizontal_op]\n",
    "    c_trans += [resize_op, rescale_op, normalize_op,\n",
    "                changeswap_op]\n",
    "\n",
    "    # Apply map operations on images.\n",
    "    data_set = data_set.map(operations=type_cast_op, input_columns=\"label\")\n",
    "    data_set = data_set.map(operations=c_trans, input_columns=\"image\")\n",
    "\n",
    "    # Apply batch operations.\n",
    "    data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)\n",
    "    # Apply repeat operations.\n",
    "    data_set = data_set.repeat(repeat_num)\n",
    "\n",
    "    return data_set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 建立模型\n",
    "\n",
    "这里以VGG16模型为例，您也可以替换为自己的模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _make_layer(base, args, batch_norm):\n",
    "    \"\"\"Make stage network of VGG.\"\"\"\n",
    "    layers = []\n",
    "    in_channels = 3\n",
    "    for v in base:\n",
    "        if v == 'M':\n",
    "            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]\n",
    "        else:\n",
    "            conv2d = nn.Conv2d(in_channels=in_channels,\n",
    "                               out_channels=v,\n",
    "                               kernel_size=3,\n",
    "                               padding=args.padding,\n",
    "                               pad_mode=args.pad_mode,\n",
    "                               has_bias=args.has_bias,\n",
    "                               weight_init='XavierUniform')\n",
    "            if batch_norm:\n",
    "                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU()]\n",
    "            else:\n",
    "                layers += [conv2d, nn.ReLU()]\n",
    "            in_channels = v\n",
    "    return nn.SequentialCell(layers)\n",
    "\n",
    "\n",
    "class Vgg(nn.Cell):\n",
    "    \"\"\"\n",
    "    VGG network definition.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, base, num_classes=1000, batch_norm=False, batch_size=1, args=None, phase=\"train\"):\n",
    "        super(Vgg, self).__init__()\n",
    "        _ = batch_size\n",
    "        self.layers = _make_layer(base, args, batch_norm=batch_norm)\n",
    "        self.flatten = nn.Flatten()\n",
    "        dropout_ratio = 0.5\n",
    "        if not args.has_dropout or phase == \"test\":\n",
    "            dropout_ratio = 1.0\n",
    "        self.classifier = nn.SequentialCell([\n",
    "            nn.Dense(512*7*7, 4096),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout_ratio),\n",
    "            nn.Dense(4096, 4096),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout_ratio),\n",
    "            nn.Dense(4096, num_classes)])\n",
    "\n",
    "    def construct(self, x):\n",
    "        x = self.layers(x)\n",
    "        x = self.flatten(x)\n",
    "        x = self.classifier(x)\n",
    "        return x\n",
    "\n",
    "\n",
    "base16 = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M']\n",
    "\n",
    "\n",
    "def vgg16(num_classes=1000, args=None, phase=\"train\"):\n",
    "    net = Vgg(base16, num_classes=num_classes, args=args, batch_norm=args.batch_norm, phase=phase)\n",
    "    return net"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 运用MembershipInference进行隐私安全评估\n",
    "\n",
    "1. 构建VGG16模型并加载参数文件。\n",
    "\n",
    "    这里直接加载预训练完成的VGG16参数配置，您也可以使用如上的网络自行训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load parameter\n",
    "parser = argparse.ArgumentParser(\"main case arg parser.\")\n",
    "args = parser.parse_args(args=[])\n",
    "args.batch_norm = True\n",
    "args.has_dropout = False\n",
    "args.has_bias = False\n",
    "args.padding = 0\n",
    "args.pad_mode = \"same\"\n",
    "args.weight_decay = 5e-4\n",
    "args.loss_scale = 1.0\n",
    "args.data_path = \"./cifar100\"\n",
    "args.pre_trained = \"./ckpt/0-70_781.ckpt\"\n",
    "args.device_target = \"Ascend\"\n",
    "\n",
    "# Load the pretrained model.\n",
    "net = vgg16(num_classes=100, args=args)\n",
    "loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True)\n",
    "opt = nn.Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9,\n",
    "                  weight_decay=args.weight_decay, loss_scale=args.loss_scale)\n",
    "load_param_into_net(net, load_checkpoint(args.pre_trained))\n",
    "model = Model(network=net, loss_fn=loss, optimizer=opt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 加载CIFAR-100数据集，按8:2分割为成员推理模型的训练集和测试集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] MA(27840:140583037941568,MainProcess):2021-04-08 16:10:33,472 [<ipython-input-8-596179693c5b>:8] [MembershipInference_test] Data loading completed.\n"
     ]
    }
   ],
   "source": [
    "train_dataset = vgg_create_dataset100(data_home=args.data_path, image_size=(224, 224),\n",
    "                                      batch_size=64, num_samples=5000, shuffle=False)\n",
    "test_dataset = vgg_create_dataset100(data_home=args.data_path, image_size=(224, 224),\n",
    "                                     batch_size=64, num_samples=5000, shuffle=False, training=False)\n",
    "train_train, eval_train = train_dataset.split([0.8, 0.2])\n",
    "train_test, eval_test = test_dataset.split([0.8, 0.2])\n",
    "msg = \"Data loading completed.\"\n",
    "LOGGER.info(TAG, msg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 配置推理参数和评估参数\n",
    "\n",
    "    设置用于成员推理的方法和参数。目前支持的推理方法有：KNN、LR、MLPClassifier和RandomForestClassifier。推理参数数据类型使用list，各个方法使用key为\"method\"和\"params\"的字典表示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = [\n",
    "            {\n",
    "                \"method\": \"lr\",\n",
    "                \"params\": {\n",
    "                    \"C\": np.logspace(-4, 2, 10)\n",
    "                }\n",
    "            },\n",
    "         {\n",
    "                \"method\": \"knn\",\n",
    "                \"params\": {\n",
    "                    \"n_neighbors\": [3, 5, 7]\n",
    "                }\n",
    "            },\n",
    "            {\n",
    "                \"method\": \"mlp\",\n",
    "                \"params\": {\n",
    "                    \"hidden_layer_sizes\": [(64,), (32, 32)],\n",
    "                    \"solver\": [\"adam\"],\n",
    "                    \"alpha\": [0.0001, 0.001, 0.01]\n",
    "                }\n",
    "            },\n",
    "            {\n",
    "                \"method\": \"rf\",\n",
    "                \"params\": {\n",
    "                    \"n_estimators\": [100],\n",
    "                    \"max_features\": [\"auto\", \"sqrt\"],\n",
    "                    \"max_depth\": [5, 10, 20, None],\n",
    "                    \"min_samples_split\": [2, 5, 10],\n",
    "                    \"min_samples_leaf\": [1, 2, 4]\n",
    "                }\n",
    "            }\n",
    "        ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们约定标签为训练集的是正类，标签为测试集的是负类。设置评价指标，目前支持3种评价指标。包括：\n",
    "* 准确率：accuracy，正确推理的数量占全体样本中的比例。\n",
    "* 精确率：precision，正确推理的正类样本占所有推理为正类中的比例。\n",
    "* 召回率：recall，正确推理的正类样本占全体正类样本的比例。在样本数量足够大时，如果上述指标均大于0.6，我们认为目标模型就存在隐私泄露的风险。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = [\"precision\", \"accuracy\", \"recall\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. 训练成员推理模型，并给出评估结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] MA(27840:140583037941568,MainProcess):2021-04-08 16:12:20,029 [<ipython-input-11-ec12f5d4774e>:5] [MembershipInference_test] Membership inference model training completed.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Method: lr, {'precision': 0.6514925373134328, 'recall': 0.8525390625, 'accuracy': 0.6982421875}\n",
      "Method: knn, {'precision': 0.5489396411092985, 'recall': 0.6572265625, 'accuracy': 0.55859375}\n",
      "Method: mlp, {'precision': 0.6491739552964043, 'recall': 0.65234375, 'accuracy': 0.64990234375}\n",
      "Method: rf, {'precision': 0.6684574059861857, 'recall': 0.8505859375, 'accuracy': 0.71435546875}\n"
     ]
    }
   ],
   "source": [
    "inference = MembershipInference(model)                  # Get inference model.\n",
    "\n",
    "inference.train(train_train, train_test, config)        # Train inference model.\n",
    "msg = \"Membership inference model training completed.\"\n",
    "LOGGER.info(TAG, msg)\n",
    "\n",
    "result = inference.eval(eval_train, eval_test, metrics) # Eval metrics.\n",
    "count = len(config)\n",
    "for i in range(count):\n",
    "    print(\"Method: {}, {}\".format(config[i][\"method\"], result[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. 实验结果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "成员推理的指标如下所示，各数值均保留至小数点后四位。\n",
    "\n",
    "以第一行结果为例：在使用lr（逻辑回归分类）进行成员推理时，推理的准确率（accuracy）为0.69824，推理精确率（precision）为0.65149，正类样本召回率为0.85254，说明lr有69.8%的概率能正确分辨一个数据样本是否属于目标模型的训练数据集。在二分类任务下，指标表明成员推理是有效的，即该模型存在隐私泄露的风险。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考文献\n",
    "\n",
    "[1] [Shokri R , Stronati M , Song C , et al. Membership Inference Attacks against Machine Learning Models[J].](https://arxiv.org/abs/1610.05820v2)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore-1.1.1",
   "language": "python",
   "name": "mindspore-1.1.1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
